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Abstract 

We consider heteroscedastic nonparametric regression models, when 
both the mean function and variance function are unknown and to 
be estimated with nonparametric approaches. We derive convergence 
rates of posterior distributions for this model with different priors, in- 
cluding splines and Gaussian process priors. The results are based on 
the general ones on the rates of convergence of posterior distributions 
for independent, non-identically distributed observations, and are es- 
tablished for both of the cases with random covariates, and determin- 
istic covariates. We also illustrate that the results can be achieved for 
all levels of regularity, which means they are adaptive. 

1 Introduction 



The posterior distribution is said to be consistent if the posterior probability 
of any small neighborhood of the true parameter value converges to one. In 
recent years, many results, giving condition, under which the posterior dis- 
tribution is consistent have appeared, especially under the situation that the 
parameter spaces are in finite-dimensional. For example, Barron et al pQ gave 
necessary and sufficient conditions for the posterior consistency , and results 
were then specialized to weak and L\ neighborhoods from Kullback-Leibler 
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neighborhoods. For details, we refer the reader to pQ; [2J; The consistency 
of posterior distributions in nonparametric Bayesian inference has received 
quite a lot of attention ever since 1986, when Diaconis and Freedman gave 
counterexample to argue that Bayesian methods sometimes can not work. 
On the positive side, consistency has been demonstrated on many models 



In nonparametric Bayesian analysis, we have an independent sample 
Yi, • • ■ ,Y n from a distribution P with density po with respect to some mea- 
sure on the sample space (y, B). The model space is denoted by V which is 
known to contain the true distribution Po- Given some prior distribution n 
on V, the posterior is a random measure given by 



For ease of notation, we will omit the explicit conditioning and write II (A) 
for the posterior distribution. We say that the posterior is consistent if 



for any e > 0, where d is some suitable distance function between probability 
measures. 

Furthermore, issues of rates of convergence are of interests on. We say 
the rate is at least e n if for a sufficiently large constant M 



where e n is a positive sequence decreasing to zero. Ghosal and van der Vaart 
[12] ; presented general results on the rates of convergence of the posterior 
measure , and [13] then generalized the results to case even the observations 
are not i.i.d, which is useful for the model considered in this article. 

For Bayesian nonparametric regression models, one of the common ap- 
proaches is through the splines basis expansion for regression functions, 
Ghosal and van der Vaart [12] gave the posterior consistency rate for regres- 
sion model with unknown mean function and normal distributed error vari- 
able with zero means and known variances a 2 , using this approach. T.Choi 
and M. Schervish|14j provided sufficient conditions for posterior consistency 
in nonparametric regression problems with homogenous Gaussian errors with 
unknown level by constructing tests that separate from the outside of the 
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suitable neighborhoods of the parameter. Amewou-Atisso, Ghosal, Ghosh 
and Ramamoorthi [T5] presented a posterior consistency analysis for linear 
regression problems with an unknown error distribution which is symmetric 
about zero. Besides, both papers did not consider the rates of convergence. 

In this paper, we give the convergence rates for heteroscedastic nonpara- 
metric regression models, when we use nonparametric methods to estimate 
the unknown variance function and the unknown mean function simultane- 
ously. Besides, as in [2], we also deal with two types of covariates either 
randomly sampled from a probability distribution or fixed in advance. When 
the covariate values in one-dimensional, we use the approach of splines ba- 
sis expansion for regression functions and give the convergence rate. For 
high dimensional cases, we use rescaled smooth Gaussian fields as priors for 
multidimensional functions to get the result. 

Using Gaussian process in the context of density estimation is another 
common approach in Bayesian nonparametric analysis. It is first used by 
Leonard [16J and Lenk jTT]. Recently, many results on posterior consistency 
are induced by the Gaussian process prior, such as in [IB] , and [IS] . Van 
der Vaart and van Zanten [HJ] derived the rates of contraction of posterior 
distributions on nonparametric or semiparametric models based on Gaus- 
sian processes and showed that the rates depend on the position of the true 
parameter associated with the reproducing kernel Hilbert space of the Gaus- 
sian process and the small ball probabilities of the Gaussian process. With 
rescaled smooth Gaussian fields as priors, they[2n] extended the results to be 
fully adaptive to the smoothness. 

The rest of the paper is organized as follows. In Section 2, we describe the 
regression model. In Section 3, we give the main results of the posterior con- 
vergence rates with splines basis expansion approach and Gaussian process 
approach. Section 4 contains proofs with some lemma left to the Appendix. 
We discuss about the results and some directions on future work in section 
5. 

2 The model 

We consider the heteoscedastic nonparametric regression model, where a ran- 
dom response y corresponding to a covariate vector x taking values in a 
compact set T C R d , without loss of generality, we assume that T = [0, l] d . 
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To be specific, the regression model we consider here, is the following: 



*~#(0,1), f . 

^(.)~n lf U 
/(0 = io g y(.)~n 2 , 

when r)(.) is the mean function, and V(.) is the variance function. Let 0^ 
be the abstract measure space where the function g(x) (g(x) indicates rj(x), 
or f(x)) belongs to, with respect to a common a-finite measure. With the 
assumption 8™ and are independent, we define the jointly parameter 
space the product space of 0™ and 8^, Since the parameter space is 
infinite-dimensional, we consider a sieve n growing eventually to the space 
of 0,with 6i C •■• ,C 0„ C and U0 n = 0. We model the unknown 
function r)(.) and /(.) with suitable prior distributions Iln and IL^ on their 
parameter sieve spaces, respectively. 



3 The main results 

In this section, we give the rates of convergence of the nonparametric re- 
gression model described in Section [TJ The parameter is 6 = (77, V) with 
$0 — (Voi ^0) being the true functions. 

Let V v y be the distribution of y. To be specific, for our model 



We use d\ to denote the squares of the Hellinger distances. It means, for 
random covariates and fixed covariates 




d 2 n (V myi ,V V2 , V2 ) = / / (K, Vl -Vl V2 y dy dQ(x). (3) 



For random covariates, Q(x) denotes the distribution function of x, and for 
fixed covariates, it is the empirical probability measure of the design points, 
which is defined byP* = n -1 Y17=i 
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The Kullback-Leibler divergence and variance divergence of P m y 1 and 
P m ,v 2 f° r fixed x are denned in the following way: 



Kx(V ni , Vl ,V m , Vl ) = ! V mM log(^)A/; 

f Vy M 

Var^(V myi ,V m y 2 ) = / V m>Vl (log(-^) - K x (V vuVl ,V m y 2 )) 2 dy. 
For the specific model in section 2: 

K^V m y^ m y 2 ) - - log — - -(1 - — ) + — , 

^ar x (^ liyi ,n 2>y2 ) = 2[-i + i^||] 2 + t^M*) - ^ 2 (x)]] 2 . 

Correspondently, the average Kullback-Leibler divergence and variance di- 
vergence are in the forms of 



K {V m y^V m y 2 ) = J Kjp^y^V^) dQ(x); 
Var i'P m y 1 ,'Pr l2 y 2 ) = J Var^(V m y 1 ,V m y 2 ) dQ(x). 

In the remainder of the article, let ||.|| n stand for the norm on /^(Q), 
denotes the supreme norm . 



(6) 



3.1 Splines 

In this section we give the convergence rates to prior distributions on spline 
models for regression functions. We restrict ourselves to the one-dimensional 
case here, though for higher dimensions case, tensor splines can be used. 

The basic assumption for the true densities of the mean function and vari- 
ance function is that they belong to the Holder spaces C a [0, 1] and C 7 [0, 1], 
respectively, where a, 7 > could be fractional. The Holder space C a [0, 1] 
is constructed by all functions that have a derivatives, with a being the 
greatest integer less than a and a th derivative being Lipschitz of order 
a — aQ. 

Throughout this article, we fix an order q, which is a natural number 
satisfied q > marc{a:,7}. A B-spline basis function of order q consists of q 
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polynomial pieces of degree q—1, which are q — 2 times continuously differen- 
tiable throughout [0,1]. To approximate a function on [0,1], we partition the 
interval [0,1] into K n subintervals {{k — 1)/K n , k/K n ] for k = 1, 2, ■ ■ • , K n , 
with {K n } being a sequence of natural numbers increasing to infinity as n 
goes to infinity Each subinterval {{k — 1)/K n ,k/K n ] is approximated by a 
polynomials of degree strictly less than q. The number of basis functions 
needed is J n = (q + K n — 1). The basis functions can be denoted as Bj, with 
j = 1, 2, • • • J n . Thus, the space of splines of order q is a J n - dimensional linear 
space, consisted by all functions from [0, 1] to R in form of g = ^2j =1 PjBj. 

As in [13], the B-splines satisfy (i) Bj > 0, j = 1, 2, • • • J n , (ii) B j = l i 

(iii) Bj is supported inside an interval of length q/K n and (iv) at most q of 
Bi,B 2 , - ■ ■ , Bj n are nonzero at any given x. 

We denote g= / or 77, and put prior on g by a prior on (3 = • • • , (3j n ) T , 
the spline coefficients, where g is represented as gp{x) = /3 T B(x). Let Un^ be 
priors induced by a multivariate normal distribution Nj n (0,I) on the spline 
coefficients. 

We also assume the regressors are sufficiently regularly distributed, by 
satisfying the condition expressed in the following term 

4~W</3 T £/3< tfWPW 2 , (7) 

where S = ( J BiBj d Q), \\.\ \ is the Euclidean norm on M J ". 

Theorem 1. Assume that r/ G C a [0, 1], V G C 7 [0, 1] for some a, 7 > | ; Vq 

is away from 0, and Ffy holds. Let Tin and Tln^ be priors of r] and f both 
induced by Nj n (0,I) on the spline coefficients. If 

J n ~ minf^/logn) 1 /^ 2 ^,^ 2 ^}, 

then the posterior converges at the rate 

e n ~ max{(n/\ogn)- a ^ 1+2a \n-^( 2+2 ^}, 

relative to d n . 

Usually, we can view J n to be a sequence of random variables with a prior 
distributions. It can be prove that the posterior can convergence at the same 
rate. 
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Corollary 1. Assume that rj G C a [0, 1],V G C 7 [0, 1] for some a, 7 > ~ ; 

Vq is away from 0, and ([?]) holds. Let Hf? and Hn be priors of r] and 
f both induced by Nj n (0,I) on the spline coefficients. J n is a sequence of 
geometric distributed random variables with successful probability p n satis- 
fying p k n n ~ x (1 — p n ) = e~ nen , with k n = [mm{(n/ \ogn) 1 ^ 2a+1 \ n 1 ^ 2+27 ^}J , 
e n ~ max{(n/ \ogn)~ a ^ 1+2a \ n~ J ^ 2+2 ^}.Then, the posterior convergence 
rate is e n , relative to d n . 

3.2 Gaussian process prior 

For higher dimensional case, we employ prior distributions, constructed by 
rescaling smooth Gaussian random field. Let be C[0, l] d , the space of all 
continuous functions defined on [0, l} d . As in [20], we set = (W^ : x G 
M. d ) to be a centered, homogeneous Gaussian random field with covariance 
function of the form, for a given continuous function 0: 

EW^W t {9) =<f)(s-t). 

To be specific, we choose = (Wx 9 ^ : x G M. d ) to be the squared ex- 

ponential process, which is the centered Gaussian process with covariance 
function 

EW^W ( t 9) = exp(-\\s - t)|| 2 ), 

where ||.|| is the Euclidean norm on M. d . 

Let A be a random variable defined on the same probability space as 
and independent of W^ 9 \ Here we assume A d possesses a Gamma dis- 
tribution. W^ 9 ' A is used to denote the rescaled process x — > Wax restricted 
on [0, which can be considered as a Borel measurable map in the space 
C[0, l] d , with the uniform norm ||.||oo> as showed in [20]. 

Theorem 2. Assume that r] G C a [0, l] d , V G C 7 [0, l] d for some a, 7 > | ; 
Vo is away from 0. We consider the prior on g is (g denotes f or rj) W^ A , 
which is the restricted and rescaled squared exponential process with A d a 
Gamma distributed random variable. Then, the posterior converges at the 
rate 

e n = max{n- Q /( d+2Q ) (log n ) (<*+!)<*/(**+<*) ? ^-7/(^+27) ( log n) (<*+i) 7 /(2 7+ «i) } 
relative to d„. 
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The proof can be found in Section 4. Also, this rate of contraction is not 
minimax. By choosing a different prior for A, the power (d + l)a/(2a + d) of 
the logarithmic factor can be improved. Though the prior does not depend 
on a and 7, the convergence rate is true for any level of a, and 7. In this 
sense, it is rate-adaptive. 

If we do not consider about the property of adaption or the regularity 
levels are known, we can find the minimax rate by using proper priors. 

Corollary 2. Assume that r] G C a [0, 1], V G C 7 [0, 1] for some a, 7 > \, 
Vq is away from 0. For simplicity, we only consider the one- dimensional 
situation for simplicity. We denote to be a standard Brownian motion 
and Z , ■ ■ ■ Zk g independent standard normal random variables. We consider 

the prior on g is the process x — > Iq+Wx 9 ^ + Yli=i ZiX % /i\, where Iq+W de- 
notes x — >■ Jq W(x)dx, and Iq + W denotes Iq^I^W). Then, the posterior 
converges at the rate 

max{n- Q /( 2fc " +2 ),n-^ +2 } 
When 7 = kf + 1/2 and a = k v + 1/2, 

e„ = max{n^ 1+2Q \n-^ 1+2 ^} 
which is the minimax rate. 

This example shows, for the case a, and 7 are known, we can use the 
above specific Gaussian process prior to get the minimax rate. However, this 
is not optimal for all level of a and 7, so other choice of k g will corresponds 
to under-or-over-smoothed prior. 

4 The proofs for the main results 

In preparation for the proofs of the main results, we first collect some lemmas, 
which are used to bound the average hellinger distance entropy, Kullback- 
Leibler divergence and variance divergence with the L 2 norm of the regression 
functions. 
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Lemma 1. The average hellinger distance entropy of the product space n 

can be bounded by a multiple of the summation of \\.\\ n - entropy of On and 
( f) 

Qn , reminding that f = \ogV, which means 

log iV(3e, n , d n ) < log N(e/e N » , W , 1 1 . | | n ) + log N(e, , 1 1 . | | n ) . (8) 

With this lemma, the e-covering number relative to <i n -metric can be es- 
timated that with relative to /^-metric. 



Lemma 2. Under the assumption that both fi and f 2 are uniformly bounded 
by a constant N, 

^ 1 ,v 1 ,n 2 ,v 2 )<(l + e 2JV )(iir ?1 -7 ?2 || n + ||(/ 1 -/ 2 )|i^); 
VariP^P^Ke^illrj-nolll+llf-Mll). 

We use this lemma to estimate the prior concentration probability. The 
proofs can be found in the Appendix. 



4.1 Proof for theorem 1 

We consider sieve n = 0n x ©n 7 ^ where 

e^ = {f^esupp{U^},\\M\<N n }; 

e^ = { Vp esu P p{u^},\\^\\<M n }, 

where supp{H n } means the support of IT n and M n , N n are sequence of 
real numbers goes to infinity as n goes to infinity. Since we suppose r] Q £ 
C a [0, 1], V £ C 7 [0, 1], and V is away from 0, we have log Vb £ C 7 [0, 1], too. 
By the Lemma 4.1 in [12], there exists some (3 m ,(3f £ M Jn (dependent on n), 
for the true density of /o and rjo, the basic approximation property of splines 
are satisfied as 

\\Pj B- /o||oo<AJ- 7 ||/ || 7 ; 
||/3, T 5-r/o||oo< A'j- a ||^o|U 

where A, and A' are constant. 
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Under the assumption of ([7j) in Theorem 1, we can use Euclidean norms 
on the spline coefficients to control the L 2 distance of functions, since for all 

(3,(3' e R J ", 

C- l \\p-p'\\ < VJ\\g p -gp>\\ n < (C'y'WP-p'W (11) 

are satisfied for some constants C and C' . 

We verify all the conditions of general results on rates of posterior con- 
traction (e.g. Theorem 4 of JT3] ), except that the local entropy in condition 
(3.2) is replaced by the global entropy log N(e, O n , d n ) without affection rates. 
The parameter 9 in Theorem 4 of [13J is (rj, V) with 8q = (rjo, Vq). 

We start from the estimation of entropy number. We project go onto 
the J n -dimensional space of splines and denote the projection function g R ( n ). 

Using the property of projection combined with ( TTTj) . we have that {(3 : 
\\gp — go\\n < e} C {(3 : I \(3 — (3^\\ < CyfJ^e) for every e > 0. For details, 
please refer to [13]. Thus, we can use the C^fT^t- covering numbers relative 
to Euclidean norm to bound the e-covering number of the set {(3 : | \gp — go\ \ n } 
relative to L 2 norm. Thus, we have 

N(e/3, e£M|.||„) < N(C^T n e, Q%\\\.\\) < (^)^, (12) 

where K is a constant, rj can be replaced by / with M n replaced by N n 
together. So by lemma [TJ the entropy condition logiV(e, © n , d n ) < ne 2 is 
satisfied, provided J n logM„ < ne 2 n , J n N n < ne 2 n and J^loge" 1 < ne 2 n . 

Then, we turn to estimate the prior concentration probability for the true 
density, which is in form of 

U n {B n {{ Vo J ),e n ;2)) = |fa,V) : K(V v>Vl V m ,v ) < e 2 ,Var(V v y,V Vo y ) < e 2 

(13) 

We denote N/2 to be ||/o||oo- Under the assumption that \\f\\oo < N and 
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Il/olloo < N, when n is sufficiently large, 
n n {B n {{r]o, /„), e n - 2)) 



> |(77,V) : A^P^P^) < e^ar^y,^) < e 2 ,||/[U < ivj 

> n n (||/ - /o|| 2 + |fo - ijblli; < e" 4JV e 2 , H/IU < iV) 

— AN -AN 

> ni /} (/ : II/- h\\l < "—el 1 1/| loo < iV) x 4% : |fo - 1 1 ^ < ^-e 2 , 

> Pr ((3:\\f3-(3 < f ) \\<e~ 2N C'./T n e n ,\^ n) \<N) 

x Pr 1 1/3-/3^| | <e- 2Ar C'v^e„,|/3| n) |<iV) 

> ( inf d>(^)) 2 Vol(l3 : ||/3 - M n) || < e- 2J Ve n )V^(/3 : \\0 - f3^\\ < e 

> 2J n 

(14) 

where wo/ denotes the volume in Euclidean space and inf d>(0i) repre- 

P 1 E[-2N,2N] 

sents the infimum value of density function 0, which is the density function 
of normal distribution, constrained on the open set [— 2iV, 2iV]. The second 
inequality is derived from lemma [2j inf <j>(Pi) is a real number away 

0ie(-2N,2N) 

from zero, which can be derived from the facts that <fi is nonzero at any point 
belongs to R alone with its continuity, and [-2N, 2N] is a compact set in KL 
To satisfy the entropy and the prior concentration conditions, it is neces- 
sary that J n N n < nel, J n \ogM n < ne 2 , and J^loge" 1 < ne 2 together with 
e n > 2J~ U , where v = min{o;,7}. When we set N n ~ n l ^ 2v+2 \ M n ~ n, all 
conditions of above are satisfied, with 

and 

e n ~ max{(n/logn)- Q/(1+2Q) ,n- 7/(2+27) }. 

The left is to get the condition on which the probability assigned by prior 
to O n complement is exponentially small. As we mentioned, rjp = (3 T B(x) for 
all x G [0, l] d , and | J2t Pj B j\ < max/™ 1 1/?,-|. Then for t n > 0, by Markov's 
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inequality and Chernoff Bounds, we have 

J n 



1 

Pr { sup | V pjBj] >M n }< J n exp ( - t n M n + -t*)2$(* n ), (15) 
L xe[o,i] j J v 1 ' 

where $ is the standard normal distribution function. By taking t n = M n , 
we have 

r Jn i / M 2 \ 

Pr sup | V fyBj\ >M n \< J n exp( - . (16) 

L xG[0,l] ~~ J V 2 / 

With the M n , N n , J n and e n defined as above, and n sufficiently large, 

J n exp(-^S) <exp(-ne^), (17) 
and the formula replacing M n with N n are also satisfied. Thus, 

n n (e \ e n ) < i#>(e</> \ + n£>)( e o») \ ef)) 

= Pr { sup \Y 0j B j\ > M n] + Pr { sup | V /3j5,| > JV n 
l xe[o,i] j > Ue[o,i] ~ 

< exp ( - ne^J . 
The whole proof is completed. 

Remark 1. When we generalize the priors of r\ and /, which are induced by 
the spline coefficients, with some limitation, the convergence rate will stay 
unchanged. We assume the same prior II on each (3j G M., j — 1, • - • , J n , 
with density function d((3j) G C[M] (the set of continuous functions), which 
satisfies 

n(|ft| > M) < 

d((3j = r) ^ /or any r G K, 

where p is a real number larger than 1. The normal distribution can be 
viewed as a special case satisfying (fiT)|) . Then, with e n , M n N n , and J„ 
defined as above, [18] are not affected, since 



Pr { sup | PjBj I > M n {N n , resp.)) } < J n exp ( - < exp ( - ne 

(20) 



xe[o,i] , 
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The prior concentration probability estimation can also be bounded below 
by a multiple of the volume of a Euclidean ball. Added with the fact that 
priors does not affect the entropy, we finish showing that the convergence 
rate can keep still when we generalize the priors. 

4.2 Proof for corollary 1 

The proof is almost the same with that for theorem 1 . We consider the sieves 
Q n = ei /} x 0^ in the form of 

= {f p e supp{nW},\\fp\\ < N n ,j n < U; 
e?> = ft, e su PP {u^}, 1 1^| | < M n , j n < k n }, 

where k n = Lmin{(ra/ logn) 1 '( 2a+1 \ n 1 ^ 2+27 ^}J and [_-J denotes the Integral 
part. With ( 1121) , the e n -entropy of 6 n is bounded by a multiple of ( M " Nn ) Jn x 

(^ a -) Jn , which have been proved to be always bounded by a multiple of e ne 
with J n < [mm{(n/ \ognfl {2a+1 \n 1 / {2+2 ^}\ ) M n ~ n, N n ~ n^+^and 
e n ~ max{(n/\ogny a ^ 1+2a \n-^^ 2+2 ^}. 

The prior concentration probability ( TT3l) can be estimated in the form of 

= Fl ( J n = k)Il n {B n {(r)o, f ), e; 2), J n = fc) 
fc=i 

>Pr(J„ = £;„)( inf 0(A)) Vo/(/3 : - M n) || < e~ 2N C' e)Vol(f3 : - (3^\\ < e' 2N C'e) 
> Pr(J n = fc n )e 2fc " 

With the assumption for p n and the fact that we have already proved e 2Jn > 

e~ ne2 with J n ~ minl^/logn) 1 /^^,^ 1 /^ 27 )}, and e n ~ max{(n/ logn)- a /( 1+2a \ n~T/( 2+2 ^}. 

we can guarantee 

p^ X (l-Pn)ef">e-<. 
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We compute the probability of (On ) c as following: 

n„((e^) c ) 

k n k oo 

= VPr(J n = fc)Pr{ sup | > M n}+ J2 ?<Jn = k) 



I, 1 ^l ' 1 ! j I, /,.-! 



fcn * f 2 , 00 



<^Pr(J„ = A;)A;exp(-^)+ £ Pr(J n = k) 

k — 1 k — kfi ~\~ 1 

<A;„exp(-^)+ £ Pr(J n = fc) 



fc=fc„+i 



We derive the last < through the facts that k n exp ^ — < e ne2 , and the 
assumption p^ n_1 (l — p n ) = e~ ne ™. 

4.3 Proof for theorem 2 

We denote k to be a or 7. By theorem 3.1 in [20], there exists a Borel 
measurable subset of C[0, l] d such that 

Pr(||M/^- & oiU<en)>e- ne »; 

p r (V^M £ B<f>) < e" 4ne "; (21) 



logiV(e n ,^,|l-lloo) <K^ 



hold, for every sufficiently large n, and e n = n~ K l 2 ( K+d \\ogn) <yd+1 ^ K ^ 2K+d \ 
K^ 9 ' is a sufficiently large constant. As stated in [20], this power can be 
improved by using a slightly different prior for A. Then, the final rate of 
contraction will be improved, too, as which can be seen from the following 
proof. 

We set 6„ in the following way. Denote &iP = {W A G B^P , and \ \W A \\oo < 
N n }. So, 0n increases to JB„ as n increases to infinity. As we assumed, 
{N n } is a sequence of real numbers increasing to infinity. We choose N n 
satisfying 

Pi(W {f)A e BW) _ p r (w A e QW) < e 



-Anet 
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Then 

Pr(W A i D[p) < 2e~ 4nt ". 

This can be achieved, since Pr(T¥ (/)A E B n ) goes to 1 and e n goes to 
zero. Then we set 9 n = B$ x Q ( n f) c C[0, l] d x C[0, 

We start to verify all the conditions of general results on rates of posterior 
contraction. First, we bound the average hellinger distance entropy of the 
sieve of parameter space. 

logN(e n ,Q n ,d n ) 

< \ogN(e n , B?, ||.|| n ) + logiV(e n , &W, \\.\\ n ) 

< \ogN(e n /e N ",B%>\\\.\U + \ogN(e n , B«\ ||.|U) 

< Kne 2 n . 

The first < is from Lemma 1, the last < is because of the third inequality of 

(EH). 

To estimate the prior positivity, we still use Lemma 121 With the as- 
sumption that 1 1/| |oo < N n , and ||/o||oo < N n , for sufficiently large n, we can 



get 



n„(5 n ((r/ , /o),e n ; 2)) > n n (||/ - / ||* + \\ V - rj \\l < e~ AN "e 2 n 



p -4N n p -4N n 



> n^(/ : 11/ - /o|£ < ^-4) x : \\V - Vo\\l < —el) 

p -2N n p -2N n 

> Pr (|| W (/M - /olU < ^=-e n ) x Pr(||W<«^ - % || < —^e) 

> e~ 2ne ". 

Thus, for 6 n C C[0, l] d x C[0, l] d defined above, and 
e n = max{n- a ^ d+2a \\ognY d+1 ^ 2a+d \n-^^ 
we have proved 

\ogN(e n ,Q n ,d n ) <2Kne 2 n 
n n (B n (( m J ),e;2))>e- 2n ^ 

n n ((f, v )^e n )<3e- 4n ^. 

The three assertions can be matched one-to-one with the assumption of 
general results on rates of posterior contraction (e.g. Theorem 4 in [12]) . so 
the proof is completed. 
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The proof Corollary 2 is almost the same, except that the value of e n is 
given by Theorem 4.1 of |18j . 

5 Discussion 

In this paper, we investigated the posterior convergence rate for heteroscedas- 
tic nonparametric regression model with both mean function and variance 
function unknown and nonparametric. We considered both of the cases with 
random covariate x, and deterministic covariates. We also put the high- 
dimensional case in consideration. Though the rates we gave are not the 
minimax, they are only different with the optimal ones by a logarithmic fac- 
tor. Besides, they are optimal for every regularity level. And we gave the 
minimax rate under the condition with known regularity level. 

Whether the logarithmic factor of the posterior convergence rate is neces- 
sary for unknown regularity level is not known. To investigate this problem, 
other kinds of priors must be used, since as van der Vaart and van Zanten 
have conjectured in [2D], the logarithmic factor is necessary with the rescaled 
Gaussian random field prior, and our current method used in the section of 
splines cannot give the desired result, either. 

6 Appendix A. Proof of Lemma 1 

By applying the inequalities 2 — 2ab < 2 — 2a + 2 — 2b, when a < 1 and b < 1, 
together with 1 — e~ x < x for x > 0, and 1 — ^f^L- < (21ogx) 2 for all the x, 
we have 



2 — 2exp( 



(rjijx) - y 2 {x)) 2 
A(V 1 (x) + V 2 (x)) 




2y/V 1 (x)V 2 (x) 
V 1 (x)+V 2 (x) 



< 2(1 



K(x) + V 2 (x) 



) + 2(1 - exp{ 



Mx) -7?2(x)) 2 

A(V 1 (x) + V 2 (x)) 



}) 
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d 2 (V m , Vl ,V m y 2 ) = / / {P* - Vl v f dy dQ 



Thus ,we have 

. . n >. / ill, _ , 

< 2 /(W^M) 2 + 2 fa( x )-^(x)) 2 

held, which is followed by the result 

log N(3e, e n , d n ) < log iV(e/e^ , 6 ^ , 1 1 . | | n ) + log N(e, , \ \ . \ \ n ) . 
provided ||VJ|| > e~ Nn ,i = 1,2. 

7 Appendix B. Proof of Lemma 2 

For the Kullback-Leibler divergence, we have, 

KxtfV* , ^) = 2 hg V 1 '2 {l ~V 2 ) + 2 V^) 

-i|(/ a (*)-/x(x^ 



2 



2 ,w v 7 ' v 77 2 V 71 2 \/(x) 

We know that, for |z| < 27V, 

\z-l + e~ z \ < \z\ + \e~ z - 1| < (e 2Ar + 1)|*|; 

when z >1, 

(e 2N + l)\z\ < (e 2N + l)z 2 , 

when z <1 

°° lrl 2 /9 

|z - 1 + e~ z \ < Y z n /2 < ^1=- < {e 2N + l)z 2 . 

n=2 1 1 

Thus: 

KiV^V^^il + e^iWm- V2 \\ 2 n +||(/i-/ 2 )|| 2 ). 
For the variance divergence, we have 

vWTV^yJ = 2[-i + i^||] 2 + t^|yM*) - mi*)]] 2 - 

We can finish the proof with the inequality |1 — e z \ 2 < (e 2N ) 2 z 2 for \z\ < 2N. 
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